Exploratory Data Analysis of a “White Wine” data set by Manuhar Chopra

This data set consists of a number of results from objective tests (e.g. pH,alcohol,density) on White wines. It also includes a quality rating, which is based on sensory data (median of at least 3 evaluations made by wine experts) between 0 (very bad) and 10 (very excellent).

The citation of the data set is as below

Citation

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at:

      [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
      
      [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
      
      [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

Univariate Plots Section

First, performing some preliminary exploration of the data set to understand the structure and view the summaries of different variables

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

In the data set, the mean quality is 5.878, with a median quality of 6. Density and pH appear to be less dispersed than other variables, with their mean & median rather close.

To better visualize the distribution of quality, a bar chart is plotted below, with the color also reflecting the quality for better visualization. Also, creating a new variable (quality.f), which comes from factoring the original quality variable. This transformation may be useful in the future as it is now a discrete variable

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

Matching with our expectations, 6 has the highest count of wines.

Further analyzing the variables, it can be seen that both total and free sulfur dioxide are present in the data set. However, it seems that “bound” sulfur dioxide (total-free) is missing. Hence, this additional variable is created below, such that

bound.sulfur.dioxide = total.sulfur.dioxide-free.sulfur.dioxide

The summary of this new variable is as below

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    78.0   100.0   103.1   125.0   331.0

Now, to visually explore each of the other variables, they were plotted using grid arrange

Most of the plots seem to roughly follow a normal distribution, with the exception of residual sugar. Perhaps transforming it with the log function will result in a better trans formation.

Trying out the log transform on the residual sugar variable below to address the long tail

The residual sugar histogram now shows a bi-modal pattern. This data might be useful in further analysis in the following sections

Checking below to see how many wines have total SO2 > 50, which is the threshold value when it starts becoming noticeable in the taste

## [1] 0.9899959

Most of the wines (99%) in this database have a high SO2 content, which should be noticeable in their taste.Lets see in the further sections if this is valid

For most wines, pH should between 3 and 4. Lets see below how many wines fall outside this expected value range of pH

## [1] 0.08922009

Less than .01% lie outside the 3-4 range, with the most clustered between 2.9 and 3.

However, there seem to be some outliers in the data. Outliers can reduce the robustness of data analysis and can induce bias.Representing outliers with the graphs below

Although outliers can be clearly seen, they are not deleted from the data set as they could potentially hold some insights into the data.

Creating a final iteration of the grid of the variables, not plotting the outliers (bottom & top 1%) for all variables

Univariate Analysis

What is the structure of your dataset?

There are 11 quantitative features and 4,898 observations in the data set.

The final variable is quality which was converted to a factor, since it provides a rating from 0 to 10.

The actual quality varies from 3 to 9, with 6 as median and 5.88 as mean.

What is/are the main feature(s) of interest in your dataset?

A useful final model should be able to predict the “quality” of the wine. Hence, that is the key feature of interest.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Many of the other features likely have an impact on the final quality. For example alcohol,density, pH appear to be of interest

Did you create any new variables from existing variables in the dataset?

As the data only included free and total sulfur dioxide (sum of free & bound), I created a new variable - “bound sulfur dioxide” = Total - Free Sulfur Di oxide.

This variable may provide additional insights going forward in the analysis.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I made the following changes in the data set

  1. Added a new variable, quality.f, which is a factor of the original quality variable, as the ratings can be better analyzed as discrete values.

  2. The sugar and alcohol data did not appear to be normally distributed among all the other variables.Hence, I used the Log transformation on the residual sugar and alcohol values. Thus they became closer to a normal distribution

Bivariate Plots Section

As a first step, the ggpairs function was used to get an overview of the various relationships existing between pairs of variables. The X and the quality.f variables were excluded to have a cleaner analysis.

Some of the variable names have been shortened to increase readability, e.g. b.SO2 = bound sulfur dioxide

## 'data.frame':    4898 obs. of  15 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ quality.f           : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ bound.sulfur.dioxide: num  125 118 67 139 139 67 106 125 118 101 ...

The highest correlations with quality (positive or negative) which came from the above result are listed below

Quality and alcohol: 0.436

Quality and density: -0.307

Quality and Bound Sulfur Dioxide: -0.218

Quality and chlorides: -0.210

Quality and Total Sulfur Di Oxide :-.175

None of the values point towards a strong correlation !

In order to further analyse these variables, I am plotting below a scatter plot with for alcohol vs Quality. Including jitter, alpha & removing 1% outliers on both ends for better visualization.

Additionally, adding a geom_smooth overlay on the data to get a sense of the variation

## [1] 0.4355747

The o.436 correlation can be seen, as higher % of alcohol is leading to a higher quality score within our given range.

However, there seems to be a slight dip in quality around 9-10 % alcohol. Since alcohol has the strongest correlation, analyzing this dip in detail in the plot below. The X & Y axis have been interchanged, and a stat_summary line plots the mean alcohol % for a given quality

It appears there is an inflection point at quality =5, where the line changes slope. I.e. this means that for lower quality wine (<5), the correlation is negative and for higher quality wine, the correlation is positive.

Trying to find these correlation factors below

## [1] -0.05862333  0.46756642

Hence, as it was suspected, the correlation between alcohol & quality is -0.0586 if quality is less than 5. If the quality is equal or more than 5, then the correlation is 0.467. This means that a simple linear model will not be very accurate. A conditional or non linear model may be more suitable.

Plotting similar comparison histograms for density, bound sulfur dioxide, chlorides and total sulfur dioxide below

The following were the key takeaways from the above charts

1.Density and quality have a weak negative correlation (-0.3), with higher densities associated with lower quality

2.Bound Sulfur Dioxide (-0.22)& Chlorides (-0.21) have very weak negative correlations, with quality generally falling as values of these variables increase

3.Total Sulfur Dioxide and quality has the weakest negative correlation (-0.175) among the pairs which we selected. It is weak enough to not be able to draw conclusive results

However, a point to note is that for both bound & total sulfur dioxide, there is a weak positive correlation which then turns negative at around 5 quality level.This could be because at that levels it starts to negatively impact taste

The other variable pairs which had a high correlation among them are as below

Total sulfur dioxide and bound sulfur dioxide: 0.922

Density and residual sugar: 0.839

Alcohol and density: -0.780

Total sulfur dioxide and free sulfur dioxide: 0.616

Bound sulfur dioxide and density: 0.504

Clearly, sulfur-dioxide levels (total vs bound or total vs free) will be strongly correlated to each other. Leaving those aside , we get Density VS Alcohol, Density VS sugar & Density VS Bound SO2

Alcohol and density have a strong negative correlation (-0.780). This may seem logical as alcohol is volatile liquid with a low density. Higher alcohol % would then tend to reduce the overall density of the wine

Sugar, has a bit the reverse effect of alcohol. It is strongly positively correlated (0.839) with density. This is probably as sugar itself is rather dense, its higher amounts increases the overall wine density

Bound SO2 has a largely positive correlation (0.504), though not as strong as sugar.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the data set?

The feature of interest (quality) has weak correlation with alcohol (0.436), density (-.307),bound sulfur dioxide (-0.218), chlorides(-0.210) & Total Sulfur dioxide variables. (-0.175)

1.With Alcohol increasing from 3 to 5, the quality decreased. As the alcohol increased from 5 to 9, the quality increased.

2.Quality weakly decreases with an increase in either of the remaining four variables, i.e. density, bound sulfur dioxide,chlorides and total sulfur

However, a point to note is that for both bound & total sulfur dioxide, there is initially a weak positive correlation which then turns negative around 70-80 ppm. This could be because at that levels it starts to negatively impact taste

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

There were strong correlations between the following variables

Total sulfur dioxide and bound sulfur dioxide: 0.922

Density and residual sugar: 0.839

Alcohol and density: -0.780

Total sulfur dioxide and free sulfur dioxide: 0.616

Bound sulfur dioxide and density: 0.504

The relationships between Free, bound & total sulfur seem obvious. Besides those density was strongly positively linked with residual sugar (sugar has high density) and negatively linked with alcohol (alcohol has low density). Density was also weakly positively linked with sulfur dioxide

What was the strongest relationship you found?

Excluding the correlation between total & bound sulfur (as they are related variables), the best correlation was between density and residual sugar.

Multivariate Plots Section

The last section brought to light some interesting relationships between the various variable pairs. In this section, I will deep dive into some of those relationships, adding adding data from additional variables to look for further insights

Plotting below the alcohol vs residual sugar,where color depends on quality

We can see that residual sugar is higher at lower alcohol levels, and lower at higher alcohol levels. This also fits with the understood logic that the sugars are fermented to create alcohol in the process of wine making. Also, higher quality wines tend to have higher amounts of alcohol

Plotting below to check if a similar relationship exists between quality, density & residual sugar

The plot above reflects the findings of two of our earlier discussion points. Quality is generally falling as density increases, and that higher densities are associated with higher residual sugar levels

Plotting below to check if a similar relationship exists between quality, bound sulfur dioxide and residual sugar

This graph shows that elevated levels of bound SO2 tend to be positively correlated with higher amounts of residual sugar. This sounds logical as sugar is known to bind with sulfur to create bound SO2.Also, better quality is generally associated lower levels of bound sulfur dioxide

On the same topic, plotting the ratio of bound So2 / total SO2 with residual sugar

The graph shows some interesting results. At most low levels of quality, its is showing a negative correlation between residual sugar and proportion of bound So2. But at most high levels of quality, this correlation seems to become positive ! As they say, correlation does not imply causation, but this is still a point to note

Leaving aside quality, the other highly correlated features were alcohol, density and residual sugar. To better understand their distribution, I have distributed these variables below into “buckets”

Alcohol and density are negatively correlated, with high sugar contributing to higher densities

Faceting this graph by the quality levels yields the result below

High quality (quality >6) wines tend be more in the bottom right corner, with higher alcohol and lower densities. Quality 9 wines are in fact mostly low sugar, low density and high alcohol !

A linear model was attempted from this data, to predict wine qualities. Starting with the highest correlation variable (alcohol), each of the other variables were added one by one to select the best model

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = df)
## m2: lm(formula = quality ~ alcohol + density, data = df)
## m3: lm(formula = quality ~ alcohol + density + chlorides, data = df)
## m4: lm(formula = quality ~ alcohol + density + chlorides + pH, data = df)
## m5: lm(formula = quality ~ alcohol + density + chlorides + bound.sulfur.dioxide, 
##     data = df)
## m6: lm(formula = quality ~ alcohol + density + chlorides + pH + fixed.acidity, 
##     data = df)
## m7: lm(formula = quality ~ alcohol + density + chlorides + bound.sulfur.dioxide + 
##     total.sulfur.dioxide, data = df)
## m8: lm(formula = quality ~ alcohol + density + chlorides + pH + fixed.acidity + 
##     residual.sugar, data = df)
## m9: lm(formula = quality ~ alcohol + density + chlorides + bound.sulfur.dioxide + 
##     total.sulfur.dioxide + citric.acid, data = df)
## m10: lm(formula = quality ~ alcohol + density + chlorides + pH + fixed.acidity + 
##     residual.sugar + sulphates, data = df)
## m11: lm(formula = quality ~ alcohol + density + chlorides + bound.sulfur.dioxide + 
##     total.sulfur.dioxide + citric.acid + free.sulfur.dioxide, 
##     data = df)
## m12: lm(formula = quality ~ alcohol + density + chlorides + pH + fixed.acidity + 
##     residual.sugar + sulphates + volatile.acidity, data = df)
## 
## ================================================================================================================================================================================================
##                              m1            m2            m3            m4            m5            m6            m7            m8            m9           m10           m11           m12       
## ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)               2.582***    -22.492***    -21.150***    -21.956***    -28.574***    -30.947***    -21.620***    163.481***    -20.523**     204.084***    -20.523**     156.891***  
##                            (0.098)       (6.165)       (6.162)       (6.160)       (6.437)       (6.378)       (6.416)      (18.118)       (6.465)      (18.788)       (6.465)      (18.407)    
##   alcohol                   0.313***      0.360***      0.343***      0.340***      0.341***      0.354***      0.344***      0.146***      0.342***      0.101***      0.342***      0.183***  
##                            (0.009)       (0.015)       (0.015)       (0.015)       (0.015)       (0.016)       (0.015)       (0.024)       (0.015)       (0.024)       (0.015)       (0.024)    
##   density                                24.728***     23.671***     23.675***     31.315***     33.752***     24.098***   -163.937***     22.979***   -205.139***     22.979***   -156.909***  
##                                          (6.079)       (6.074)       (6.067)       (6.369)       (6.348)       (6.353)      (18.375)       (6.405)      (19.055)       (6.405)      (18.673)    
##   chlorides                                            -2.382***     -2.286***     -2.245***     -2.393***     -2.303***     -1.034        -2.378***     -0.886        -2.378***     -0.099     
##                                                        (0.558)       (0.558)       (0.558)       (0.557)       (0.553)       (0.562)       (0.555)       (0.559)       (0.555)       (0.544)    
##   pH                                                                  0.261***                    0.069                       0.866***                    0.912***                    0.709***  
##                                                                      (0.076)                     (0.084)                     (0.108)                     (0.108)                     (0.105)    
##   bound.sulfur.dioxide                                                             -0.001***                   -0.009***                   -0.009***                   -0.009***                
##                                                                                    (0.000)                     (0.001)                     (0.001)                     (0.001)                  
##   fixed.acidity                                                                                  -0.082***                    0.082***                    0.110***                    0.065**   
##                                                                                                  (0.016)                     (0.021)                     (0.021)                     (0.021)    
##   total.sulfur.dioxide                                                                                          0.007***                    0.007***                    0.007***                
##                                                                                                                (0.001)                     (0.001)                     (0.001)                  
##   residual.sugar                                                                                                              0.084***                    0.100***                    0.087***  
##                                                                                                                              (0.007)                     (0.008)                     (0.007)    
##   citric.acid                                                                                                                               0.130                       0.130                   
##                                                                                                                                            (0.095)                     (0.095)                  
##   sulphates                                                                                                                                               0.783***                    0.661***  
##                                                                                                                                                          (0.103)                     (0.100)    
##   volatile.acidity                                                                                                                                                                   -1.942***  
##                                                                                                                                                                                      (0.110)    
## ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##   R-squared                 0.190         0.192         0.195         0.197         0.198         0.202         0.213         0.223         0.213         0.232         0.213         0.278     
##   adj. R-squared            0.190         0.192         0.195         0.197         0.197         0.201         0.212         0.222         0.212         0.231         0.212         0.277     
##   sigma                     0.797         0.796         0.795         0.794         0.793         0.792         0.786         0.781         0.786         0.777         0.786         0.753     
##   F                      1146.395       583.290       396.315       300.857       301.971       247.495       265.003       233.554       221.190       210.783       221.190       235.493     
##   p                         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood        -5839.391     -5831.127     -5822.011     -5816.089     -5814.298     -5802.348     -5767.618     -5737.632     -5766.675     -5708.822     -5766.675     -5556.370     
##   Deviance               3112.257      3101.773      3090.247      3082.784      3080.530      3065.535      3022.369      2985.588      3021.206      2950.671      3021.206      2772.590     
##   AIC                   11684.782     11670.255     11654.021     11644.177     11640.596     11618.695     11549.237     11491.264     11549.350     11435.643     11549.350     11132.740     
##   BIC                   11704.272     11696.241     11686.504     11683.157     11679.576     11664.171     11594.713     11543.236     11601.323     11494.113     11601.323     11197.706     
##   N                      4898          4898          4898          4898          4898          4898          4898          4898          4898          4898          4898          4898         
## ================================================================================================================================================================================================

The highest R squared value achieved was 0.278, which does not amount to a very robust model. Possible explanations could be that either the model is not linear (which some of our analysis points towards) or that there are other features which are affecting the final quality

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

In this section, I observed the following relationships

  1. Quality vs alcohol & Residual Sugar: The positive correlation between quality and alcohol of the earlier analysis was added another dimension by residual sugar. Lower residual sugar tended to increase the alcohol content, and hence increase the positive impact on quality

  2. Quality vs density & Residual Sugar: Higher residual sugars increased the density, which tended to lower the overall quality of the wine. Hence, residual sugar & density strengthened their negative impact on the overall quality.

  3. Quality vs bound sulfur dioxide & Residual Sugar: Higher residual sugars tended increased the amount of bound sulfur dioxide, which increased the negative impact on the quality of the wine

Were there any interesting or surprising interactions between features?

  1. One surprising interaction was the one between bound sulfur dioxide and residual sugar. It seems higher sugar tends to bind more Sulfur di oxide,negatively impacting the overall quality of the wine

  2. The other interesting interaction was between alcohol, density and residual sugar. Higher sugar correlates with higher density, while higher sugar also correlates with lower alcohol.Given the individual correlations of each of the variables with quality, its makes sense that the best quality wines tends to have low residual sugar, low density & high alcohol.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

A linear model was fitted into the data set which attempts to predict the quality of the wine based on the the other input features.

As the number of input features was increased, the model generally improved, except for certain variables, with the highest R squared value of 0.278 which includes all the variables

This value is not particularly high. As discussed before, this could be because the variable relationships are not linear or that there are some unknown variables also affecting quality. Also, some input variables are internally correlated to each other


Final Plots and Summary

Plot One

Description One

This plot shows the important relationship between alcohol & wine quality. Although it has a positive correlation of 0.436, there seems to be a slight dip in quality around 9-10 % alcohol.

It appears there is an inflection point at quality =5, where the line changes slope. I.e. this means that for lower quality wine (<5), the correlation is negative and for higher quality wine, the correlation is positive. This also means that a simple linear model will not be very accurate in predicting this behavior. A conditional/non linear model may be more suitable.

Plot Two

Description Two

This scatter plot shows the bound sulfur dioxide vs residual sugar & quality.High residual sugar tends to increase the amount of bound sulfur dioxide, as higher sugar tends to covert free Sulfur di oxide into bound sulfur dioxide.

High sugar and consequently higher bound SO2 negatively impacts the quality of the wine, which can be seen in the graph. Higher quality wines generally tend to have lower bound SO2, with none of the highest quality (9) wines having bound SO2> ~ 120

Plot Three

Description Three

This scatter plots shows Alcohol % VS density, with the color associated with the level of residual sugar

Alcohol and density are negatively correlated. In addition, as sugar has itself a high density, higher amounts of sugar tend to increase the overall density of the wine.


Reflection

This project has helped increase both my knowledge of exploratory data analysis using R and of white wines !

This database contained 4898 data points about white wines, with each data point associated with 11 characteristics such as residual sugar, density, alcohol etc along with an overall wine quality score

I started off with getting to understand the data and exploring the given variables through individual variable plots. I then plotted variables against each other and the overall wine quality to better understand the nature and amount of their impact.Finally, selecting those variables which had the maximum impact, I performed advance plotting, visualizing and understanding multiple variables in each plot to draw interesting insights from the data

Certain variables like alcohol had a positive correlation with quality, while certain variables like density had a negative correlation with quality. It was interesting that many of these variables were also affecting each other, e.g. how alcohol and density are negatively correlated. I also created a new variable “bound sulfur dioxide” which eventually uncovered an interesting relationship with residual sugar.

In the end, I created a linear model to try and predict the quality of the wine based on the wine characteristics. The overall robustness of this model did not come out to be very high. I tried to solve this by adding one variable at a time, to try and find a combination which fits best. This process somewhat improved the R squared value, but was still not high enough to “Pop the Champagne” ! Perhaps opening some regular wine would suffice !

This could mean that either the model is not linear or there are other unknown variables affecting the final quality.

It is important to remember that correlation does not imply causation ! Hence, drawing conclusions from correlation alone is not recommended although it can give a starting point from where further analysis can be undertaken. In this project, further steps could be taken to try to fit a non linear model or explore if there are any other unknown variables present.

Nonetheless, it was a learning experience where I picked up knowledge of both the concepts and the tools of Exploratory Data Analysis